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Abstract 

We study estimation of the operator VE" in the linear model Y = &(X) + e, when X and Y take 
values in Hilbert spaces Hi and H2, respectively. Our main objective is to obtain consistency 
without imposing some rather inconvenient technical assumptions that have been used in the 
literature. We develop our theory in a time dependent setup which comprises as important special 
case the autoregressive Hilbertian model. 
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1 Introduction 

In this paper we are concerned with a regression problem of the form 

Y k = *(X k ) + e k , k>l, (1.1) 

where ^ is a linear operator mapping from space Hi to Hi- This model is fairly general and many 
special cases have been intensively studied in the literature. Our main objective is the study of 



this model when the regressor space H\ is infinite dimensional. In the latter case, model (1.1) can 
be seen as a general formulation of a functional linear model, which is one of the most popular 
topics in functional data literature. Its various forms are introduced in Chapters 12-17 of Ramsay 
and Silverman |23j . To name a few recent references we mention Cuevas et al. [9], Malfait and 
Ramsay [21], Cardot et al. [5], Chiou et al. [6], Miiller and Stadtmiiller [22], Yao et al. [25], Cai 
and Hall [3J, Li and Hsing [2D] , Reiss and Ogden [21], Febrero-Bande et al. [12] . Crambes et al. [S], 
Ferraty et al. [13] . 

From an inferential point of view a natural problem is the estimation of the 'regression operator' \E'. 
This topic has been discussed from several angles. For example, Cardot et al. [1] provide consistency 
results for the case of the 'functional regressor and scalar response model', while Cueavas et al. [5j 
consider a 'functional regressors and responses' setup assuming a non-random design. Yao et al. [25] 
also considered the 'functional regressors and responses model' but deal with the case where the 
observations are not fully observed but are obtained from sparce, irregular data measured with error. 
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The two main methods of estimation are based on principal component analysis (e.g. Bosq [ij and 
Cardot et al. [5]) or spline smoothing estimators (e.g. Hastie and Mallows [14] . Marx and Eiler 
Crambes et al. [8]). 

In this paper we address the estimation problem for ^ when the data are fully observed using the 
principal component approach. Let us explain what distinguishes our paper from previous work. 

(i) A crucial difficulty is that we are working with an infinite dimensional operator \&, which needs 
to be approximated by a sample version ^> k of finite dimension K, say. A delicate issue is the choice 
of K. In existing papers determination of K requires very specific assumptions on the spectrum 
of the covariance operator of the regressor variables. In the next section we will explain why such 
assumptions pop up throughout the literature and show that consistency can be established without 
any assumption on the spectrum by proposing a purely data-driven procedure for the choice of K. 

(ii) We allow the regressors X/. to be dependent. This is important for two reasons. First, many 
examples in FDA literature exhibit dependencies as the data stem from a continuous time process 
which is then segmented into a sequence of curves, e.g. by considering daily data. Examples of this 
kind include intraday patterns of pollution records, meteorogical data or financial transaction data 
or sequential fMRI recordings. See e.g. Horvath and Kokoszka |18j . 



Second, our framework detailed below, will include the important special case of a functional 
autoregressive model, which has been intensively investigated in the functional literature and is often 
used to model auto-regressive dynamics of a functional time series. This model is analyzed in detail 
in Bosq [2]. We can not only greatly simplify the assumptions needed for consistent estimation but 
also allow for a more general setup. E.g. in our Theorem 2.2 we show that it is not necessary to 
assume that ^ is a Hilbert-Schmidt operator. This quite restrictive assumption is very often imposed 
though it e.g. even excludes the identity operator. 

(iii) As we already mentioned before, the literature considers different forms of functional linear mod- 
els. Arguably the most common are the scalar response and functional regressor and the functional 
response and functional regressor case. We will not distinguish between these cases, but work with 
a linear model between two general Hilbert spaces. 

In the next section we will introduce notation, assumptions, the estimator and our main results. 
In Section [3] we compare the performance of the proposed estimators in a small simulation study and 
finally, in Section |4j we give the proofs. 



2 Estimation of ^ 
2.1 Notation 

Let Hi,H2 be two (not necessarily distinct) separable Hilbert spaces. We denote by C(Hi,Hj), 
(i,j G {1,2}), the space of bounded linear operators from Hi to Hj. Further we write {-,-)h for 
the inner product on Hilbert space H and \\x\\h = v/ {x, x)h for the corresponding norm. For <I> 6 
£(Hi,Hj) we denote by H^H^,^.) = sup^y < x ^(x)!)^. the operator norm and by W^WsiH^H,) = 
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Y^k=\ ll^( e fe)llff- ) ) where ei,e2, ... € i?i is any orthonormal basis (ONB) of Hi, the Hilbert- 



Schmidt norm of <£. It is well known that this norm is independent of the choice of the basis. 
Furthermore, with the inner product &)s(Hi,Hn) = X]fc>i(^K e fc)> @( e fc))-ff 2 the space <S(i?i, i?2j is 
again a separable Hilbert space. For simplifying the notation we use Cij instead of C(Hi,Hj) and in 
the same spirit || • H^, || • \\ Sij and (-,-)sy. 

All random variables appearing in this paper will be assumed to defined on some common prob- 
ability space (f2, A, P). A random element X with values in H is said to be in Li p H if i/ Pj #(X) = 
(E\\ X||j^) 1 / p < oo. More conveniently we shall say that X has p moments. If X possesses a first mo- 
ment, then X possesses a mean //, determined as the unique element for which E{X, x)h = (/•*, x)h, 
Vx G F. For X £ Hi and Y G ilj let X&Y : i?j -)■ ffj be an operator defined as X®Y(y) = {X, v)Y . 
If X and Y have 2 moments, then we say that X and Y are orthogonal {X _L Y) if i?X ® Y = 0. 
A sequence of orthogonal elements in .ff with a constant mean is called H-white noise. If X G 
then it possesses a covariance operator C given by C = S[(A — /i) (g> (A — fj,)]. It can be easily seen 
that C is a Hilbert-Schmidt operator. 

2.2 Setup 



We consider the general regression problem (1.1) for fully observed data. Let us collect our main 
assumptions. 

(A): We have Vl/ G £12- Further {st} <ind {A&} are zero mean variables and are assumed to be 
L 4 -m-approximable in the sense of Hormann and Kokoszka \17f (see below). In addition {£&} is 
H2~white noise. For any k > 1 we have _L e&. 

Here is the weak dependence concept that we impose. 

Definition 2.1 (Hormann and Kokoszka [17J). A random sequence {X n } n >\ with values in H is 
called IP -mn-approximable if it can be represented as 

X n = f(5n, S n -l,5 n -2, •••) 

where the Si are iid elements taking values in a measurable space S and f is a measurable function 
f : 5°° — > H. Moreover if 5[ are independent copies of Si defined on the same probability space, then 
for 

A^ ^ = f{S n , <5 n _ 1, S n —2i <5n— m+l> S n _ m , S n _ m _i, ...) 

we have 

00 

Y,V P M X m-X^) < OO. 

m=l 

The notion of L p -m-approximability implicitely assumes that the process is stationary. Evidently 
an i.i.d. sequence with finite fourth moment is L 4 -m-approximable. This leads to the classical 
functional linear model. However, our setup is much more general and allows e.g. to cover the 
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autoregressive Hilbertian model of order 1 (ARH(l)), given by the recursion X^+i = VP(-Xfc) + £k+l- 
(See Section [2.4| ) Other examples of functional time series covered in this framework can be found 
in [TTj. This form of weak dependence also implies that a possible non-zero mean of can be 
estimated consistently by the sample mean. Moreover we have (see [16] ) 

v^||X - n\\ Hl = Op(l). 

We conclude that the mean can be accurately removed in a preprocessing step and that EX^ = is 
not a stringent assumption. Since by Lemma 2.1 in [TTj {Yfc} will also be L 4 -m-approximable, the 
same argument justifies that we study a linear model without intercept. 

Our moment assumptions are mild. We are not aware of any article that works with less than 4 
moments, while for several consistency results bounded variables or finite moments of all order are 
assumed. 



2.3 The estimator 

We will explain now the idea behind estimation of ^. Similar as in Bosq [1] it is based on a finite 
basis approxmation of the operator. To achieve optimal approximation in finite dimension, we choose 
eigenfunctions of the covariance operator C = E[X\ <S> X\] as our basis. Another expansion based 
on predictive factors has been proposed by Kargin and Onatski [19J. Here the intention is to use 
directions that minimize the prediction error in the autoregressive context. 

Throughout this paper we use next to the covariance operator C the cross-covariance operator 
A = E[X\ (g) Y\]. By Assumption (A) both of them are Hilbert-Schmidt operators. 

Now let (Xi,Vi)i>x be the eigenvalues and corresponding eigenfunctions of the operator C, such 
that Ai > A2 > •••• The eigenfunctions are orthonormal and those belonging to a non-zero eigenvalue 
form an orthonormal basis of C{H\) = Im(C). Note that with probability one we have X £ Im(C). 
Since Im(C) is again a Hilbert-space, it is no real restriction to assume that H\ = Im(C), i.e. that 
the operator is of full rank. In this case all eigenvalues are strictly positive. Using linearity of \& and 
the requirement X^ _L we obtain 

A(v j ) = E(X 1 ,v j ) Hl Y 1 

= E{X 1 ,v j )H 1 V(X 1 ) + E{X 1 ,v j ) Hl Ei 
= *(E{X 1 ,v j ) Hl X 1 ) 

= *(C(«i)) 
= A i *(« i ). 

Then for any x G H\ the derived equation leads us to the representation 

(DO \ OO a / \ 

3=1 ) 3=1 Aj 



We assume here implicitely that dim(iii) = 00. If dim(ffi) = M < 00, then (2.1) still holds with 00 



replaced by M. In fact, here the theory would become much simpler. To avoid distinguishing between 
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the different cases we will exlusively work in the infinite dimensional setup. Equation (2.1) gives a 
core idea for estimation of We will estimate A, Vj and Xj from our sample X\, . . . , X n , Y\, . . . ,Y n 
and substitute the estimators into formula (2.1). The estimated eigenelements (\j >n ,Vj >n ; 1 < j < n) 



will be obtained from the empirical covariance operator 

1 n 

C n = — y Xf.®Xf t 
In a similar straightforward manner we set 



n 
k=l 



1 

A n = -S"X k ®Y k . 

n 

k=l 

For ease of notation we will suppress in the sequel the dependence on the sample size n of these 
estimators. 

Apparently, from the finite sample we cannot estimate the entire sequence (Xj,Vj), rather we have 
to work with a truncated version. This leads to 

where the choice of K = K n is crucial. Notice that since we want our estimator to be consistent, 
K n has to grow with the sample size to infinity in order to cover all the summands. On the other 
hand we know that Xj — > and hence it will be a delicate issue to control the behavior of A 



small error in the estimation of A ? - can have an enormous impact on (2.2). The usual approach is 



to relate K n on the decay-rate of {Xj}. For example Cardot et al. [1] assume that nX^ — > oo and 

n\ 2 K . 

(Ef = "i ~* °°' n 

a\ = Ai — A2 and ctj = minjAj-i — Aj, Aj — A J+ i}, j > 1. (2-3) 

Similar requirements are used in other papers (see e.g. Theorem 8.7 in [2] or Assumption (B.5) in 
Yao et al. [25]). We will avoid any such assumptions by suggesting K n that is purlely data-driven. 

2.4 Consistency results 



For our first result, Theorem 2.1, the sole assumptions on the spectrum and on ^ are: 

(B): The eigenvalues {Xj} are mutually distinct and ^ is a Hilbert- Schmidt operator. 

Assuming distinct eigenvalues is standard in functional data analysis and is commonly used for 
results involving functional principal components. Without this assumption the eigenfunctions in 



representation (2.1) are no longer identifiable. In Theorem 2.2 we will show that in practice we can 
completely avoid Assumption (B). The K n we use in Theorem 2.1, is given as follows: 

(K): Let K n = min(S n , E n , m n ) where B n = argmaxjj > 1|J- < m n } and E n = argmaxjfc > 

j 

1| maxi<j<fc J- < m n } for some sequence {m n } such = o(n). Here Xj and c\j are the estimates 



for Xj and aj (given in (2.3)), respectively, obtained from C . 
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The choice of K n is motivated by a 'bias variance trade-off' argument. If an eigenvalue is very 
small (in our case <C l/m n ) it means that the direction it explains is not very important in the 
representation of Xj^. Therefore excluding it from the representation of \& will not cause a big bias 
whereas it will considerably reduce the variance. It will be only included if the sample size is big 
enough, in which case we can hope for a reasonable accuracy of Xj. Since all the quantities involved 
can be computed from the sample our procedure can be a fast alternative to cross-validation or 
AIC type criteria as suggested in [25J for the choice of K in practical applications. In practice it is 
recommended to replace J- in the definition of B n by ^ and J- in the definition of E n by ^ to 

Xj Xj t*j a i 

adapt for scaling. For the asymptotics such a modification has no influence. 



Theorem 2.1. Consider the linear Hilbertian model (1.1) and assume that Assumptions (A) and 
(B) hold and {K n } is defined as in (K). Then the estimator described in Section \2.3\ is weakly 
consistent, i.e. \\^K n ~ ^\\Ci2 ~ * 0> if n ^ oo. 

The technical Assumption (B) appears still unsatisfactory. Unfortunately it cannot be completely 
avoided. To see this, assume for example that ^ is the identity operator, which is not Hilbert-Schmidt 
anymore. Then for any ONB {v{\ we have = Y2i>i v i ® v i- Even if from the finite sample our 
estimators for v±, . . . ,vk would be perfect (vi = v{) we have — ^k\\c 12 = 1 f° r an Y K > 1. This 
is easily seen by evaluating ^ and ^ k at vk+i- 

A way to overcome such difficulties it to argue that in practice we will be satisfied if the estimator 

* is such that \\^{X) - i>{X)\\ is small if X = X x . E.g. if (X, v) = with probability one, then the 

direction v plays no role for describing X and a larger value of ((^(w) — >&(?;) || doesn't pose a problem 

if for example we are interested in prediction. The next theorem shows that we can further simplify 

p 

the assumptions if we are only interested in showing — ^(X)||# 2 — > 0. In particular, (B) can 

be dropped. 



2.3 with 



Theorem 2.2. Let Assumption (A) hold and define the estimator ^ K n o,s in Section 
K n = argmax{j > 1|^- < m n }, where m n = o(^yn). Then \\^(X) - 4> Kn (X)\\ H2 A 0. 

This result should be compared to Theorem 3 in Crambes and Mas [7] where an asymptotic 
expansion of £J||^(X) — ^ r fc(^)|||f 2 is obtained (for fixed k). Their result implies consistency, but 
requires again assumptions on the decay rate of {Aj}, that ^/ is Hilbert-Schmidt, and the existence 
of moments of all order of the X^. 

It should be noted that we are studying in this paper only convergence in probability, whereas for 
example [2] or [3j have also obtained results on almost sure convergence, but for the price of further 
technical assumptions (e.g. boundedness of the X^s). 

An obvious question is which rate of growth for m n is optimal (in some sense to be specified). 
Though desirable, we belief that optimality results will be extremely difficult under the very general 
conditions of this paper. Rates of convergence seem to require more information on the spectrum 
or the operator \I> and this is exactly what we wanted to avoid here. In Section [3] we perform a 



simulation study, which suggests that Theorem 2.2 will remain true if we use m n = 0(\fn). In fact, 



for large n we see that m n = performs better than m n = 0.1^/n throughout all constellations 
that we tested. 
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2.5 Applications to functional time series 

We show that our framework covers the ARH(l) model of Bosq [2]. With i.i.d. innovations 5k G L# 
the process {Xk} defined via X k +i = ^(Xk) + 5k+i is L^-approximable if \P G C(H,H) such that 
1 1 ^ 1 1 c(H,H) < 1) see [13 ■ The stationary solution for Xj. has the form 

X k = Y^&{8 k ^). 



Setting Ek = <5fc_|_i and Y k = X^+i we obtain the linear model ( 1.1 ). Independence of {5k} implies that 
Xk -L Ek and hence Assumption (A) holds. Bosq [2] has obtained a (strongly) consistent estimator 
of ^, if \P is Hilbert-Schmidt and again by imposing assumptions on the spectrum of C . 

In our approach we don't even need that the innovations {5k} are i.i.d. As long as we can assure 
that {5k} and {Xk} are L 4 -m-approximable we only need that {5k} is //-white noise. Indeed, 
denoting A* the conjugate of operator A, we have for any x G H\ and y G H2 that 

E(X k ,x} Hl ( £ k,y}H 2 = ^2E(^ 1 (5 k ^ j ),x} Hl (5k + i,y}H 2 

= J2 E (6k-j,(* j Y(x)) Hl (6 k+1 ,y) H2 =0. 

j>0 

This shows Xk -L an d Assumption (A) follows. 
We obtain the following 

Corollary 2.1. -Lei {A„} n >i 6e an ARH(l) process given by the recurrence equation X n+ \ = ^/(X n )+ 
£ n +i- Assume \\^\\ci2 < 1- -(f {e^} is H-white noise and Assumption (A) holds, then for the estimator 

P A 

given in Theorem 2.2 we have \\^{X) — $k(X)\\h 2 — > 0. In particluar if {ei} is i.i.d. in L H , 



Assumption (A) will hold. 

Corollary 2.2. Let { X n } n >i be an ARH( 1 ) process given by the recurrence equation X n+ \ = ty{X n )+ 
£ n +i- Assume || 1 I / || l s 12 < 1. If {e{} is H-white noise and Assumptions (A) and (B) hold, then the 
estimator \I> k given in Theorem 



2.1 



is consistent. 



Another possible application of our result refers to a recently introduced functional version of 
the celibrated ARCH model (Hormann et al. [IS])) which plays a fundamental role in financial 
econometrics. It is given by the two equations 

yk(t) = e k (t)o- k (t), i€[0,l], fc€Z 

and 

alit) = 5{t) + [ ft(t, s)y 2 k _ 1 (s)ds, t G [0, 1], kZ. 
J 

Without going into details, let us just mention that using a trick one can write the squared observa- 
tions of a functional ARCH model as an autoregressive process with innovations fk(t) = y^(t) — a k (t). 
The new noise {vk} is no longer independent and hence the results of [2] are not applicable to prove 
consistency of the involved estimator for the operator ft. But it is shown in [15] that the innovations 
of this new process form Hilbertian white noise and that the new process is L 4 -m-approximable. 
This allows us to obtain a consistent estimator for ft. 
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3 Simulation study 

Didericksen et al. [10J have investigated in an empirical study the performance of different estimators 
for the regression operator $ in the FAR(l) setup X n+ \ = ^f(X n ) +e n . For their study they compare 
different kernel operators ^. By the smoothness of the chosen kernels, curves X n are mapped to 
quite smooth and flat curves ^f(X n ), even if X n is irregularly shaped. In such a setup Didericksen 
et al. [10] conclude that choosing K = 3 or K = 4 gives broadly speaking the best results among 
all chosen setups. We have choosen operators ^ that produce more distinctive curves ^(X n ). Our 
conclusion below is that if the spectrum of EX ® X is not decaying very fast, we need to choose K 
much bigger than 3 or 4 to get good estimates. This is true even for moderate sample sizes. We 
show that our procedure proposes K = K n that are close to the optimal K = K^ PT . 



3.1 Setup 

For the simulation study we obviously have to work with finite dimensional spaces Hi and H^- 
However, because of the asymptotical nature of our results, we set the dimension relatively high 
and define H\ = H2 = spanjuj : < j < 34}, where vo(t) = 1, V2k-i{t) = sin(27r/ct) and V2k(t) = 
cos(2irkt) are the first 35 elements of a Fourier basis on [0, 1]. We work with Gaussian curves Xi(t) 
by setting 

35 

X i {t)=Y,Afv j „ l (t) : (3.1) 

where (A^\ A%\ . . . , )' are independent Gaussian random vectors with mean zero and covariance 
E. This setup allows us to easily manipulate the eigenvalues {Aj} of a covariance operator Cx = 
EX (g> X. Indeed, if we define E = diag(ai, . . . ,035), then Aj = ^[0% and v% is the corresponding 
eigenfunction. We test three sets of eigenvalues: 

• Ai : (1, e" 1 / 5 , e~ 2 / 5 , . . . , e" 35 / 5 ) [fast decay], 

• A 2 : (1, §§, . . . , Jg) [slow decay], 

• A 3 : (1, 1, . . . , 1) [no decay]. 

The noise {£&} is also assumed to be of the form (3.1 ) with coefficients {A^\ i,j > 0} i.i.d. M(0, a 2 ) 
and a 2 £ {0.25, 1,2.25,4}. Finally we used the following 3 operators: 

• ^1 identity, 

• ^2 = Ti + T2, such that ri : V{ h-> ^v 7Ti and T2 : V{ ^ ^v n > , where TX{ = 1 + (i + 4 mod 35) and 
7r- = 1 + (i mod 35), 

• ^3(x) = Yli=i Sj=i ipijizi v i) v ji where the coefficients ipij have been generated as i.i.d. stan- 
dard normal random variables (once generated, they were fixed for the entire simulation), 
normalized such that ||\l/3||£ 12 = 1. 
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Figure [I] shows application of \Vi (i = 1,2,3) on four realizations of the process X and the 
corresponding curves \Vi(X), ^fi(X) and Y. 




Figure 1: Columns 1 to 4 correspond to ^i(X), $i(X), *f PT {X) and *j(X) +e (i = 1,2,3), 
respectively. Here and ^P PT are operators given by the formula (2.2) with K obtained by our 



procedure and with the optimal one in terms of NMSE (see ( |3.2[ )). Estimators were computed for 
n = 1280. The same 4 curves were used with each operator. They were drawn from a distribution 
indicated by Ai and they are presented at the top-left chart (^i = Id). 



3.2 Results 

As a performance measure for our procedure we used a normalized mean square error defined as 

mSE ~ E3UII*WII2r 8 • M 



Following Theorem 2.2 we chose m n = Cy/n with the 3 different constants c = 0.1,0.5, 1 and sample 



sizes n = 10 x 2 , t = 0, . . . , 11. The NMSE and the size of K = is shown for different constellations 
in the Appendix. We display the results only for a = 1. Not surprisingly, the bigger the variance 
of the noise, the bigger NMSE but otherwise our findings were the same across all constellations of 
a. The column K® PT shows the value of K that gave the smallest NMSE among all possible values 
K = 1,...,35. 
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The tables in the Appendix show that the choice of K proposed by our method is very satisfactory 
and close to K® PT (which gives the smallest NMSE) or at least that the corresponding NMSE's were 
comparable. We can see that for small sample sizes it is preferable to use c = 0.1 while for large n 
it turns out that c = 1 performs best. 

4 Proofs 



Throughout this entire section we assume the setup and notation of Section 2.2 



4.1 Proof of Theorem 12.11 

We work under Assumptions (A) and (B) and with K n given in (K). The first important lemma 
which we use in the proof of Theorem |2.1| is an error bound for the estimators of the operators A 
and C. Below we extend results in |17j . 

Lemma 4.1. There is a constant U depending only on the law of {(X^, such that 

nmax{£||A- A n \\ Sl2 ,E\\C - C n \\ Sll } < U. 



Proof of Lemma J^.l. We only prove the bound for A, the one for C is similar. First note that by 
Lemma 2.1 in [T7] and Assumption (A) {Y^} is also L 4 -m-approximable. Next we observe that 

2 



nE A-AJL =nE 

II II012 



1 



k=l 



where Z^ = (g) Y& — A. Set = X)l (g) Y fc — A. Using the stationarity of the sequence {Z^} 
we obtain 

2 



nE 



n 



k=l 



Sia \r\<n 



n 



E(Z , Z r )s 12 



< ^||^o||I 12 + 2^ \E(Z , Z r ) Sl2 \. 



(4.1) 



r=l 



(r—i) 

By the Cauchy-Schwarz inequality and the independence of Z). and Zq we derive: 

\E(Z ,Z r ) Sl2 \ = \E(Z , Z r - Z^) Sl2 \ < (E\\Z \\lj$(E\\Z r - Z^\\ 2 Sl2 )l 
Using \\Xo <g) Yo ||«s 12 = ||^o||hi ||^o||i? 2 an d again the Cauchy-Schwarz inequality we get 
E\\Z \\l 12 = E\\X f Hl \\Y f H2 < ul Hl (X )ul H2 (Y ) < oo. 



To finish the proof we show that ^ (E\\Z r — Z\ ; ||^ i2 )2 < oo. By using an inequality of the 

r=l 

type \ab - cd\ 2 < 2|a| 2 |6 - d\ 2 + 2|d| 2 |a - c| 2 we obtain 



E\\Z r -Z^f Sl2 



S\2 
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(r-l)||2 



<2i?||x r ||^j|y r -y r (- 1 )||^ + 2 J B||y r (- 1 )||^||x r -^ 
< 2^ (x r )ul H2 (Y r - y^- 1 )) + 2i^ a (y r <*- 1 >) I £ ff i(*r - xt 1] ] 



Convergence of (4.1) follows now directly from L -m-approximability. 



□ 



Application of this lemma leads also to bounds for estimators of eigenvalues and eigenfunctions 
of C via the following two lemmas (see |17j). 

Lemma 4.2. Suppose Aj, Aj are the eigenvalues of C and C, respectively, listed in decreasing order. 
Let Vi,i>i be the corresponding eigenvectors and let ci = (vi,Vi). Then for each j > 1, 



Oi-S \\v 



J\\ V J ^3 U 3 



CjVjWHr < 2V2\\C - C\\ Cll , 



where &j = min{Aj_i — Aj, Aj — Aj + i} and &i = A2 — Ai. 



Lemma 4.3. Let Aj, Aj be defined as in Lemma Then for each j > 1, 

|Aj - Aj| < || C - C\\c u - 



In the following calculations we work with finite sums of the representation in (2.1): 

A" 



3=1 



A 



(vj,x). 



(4.2) 



,1 



In order to prove the main result we consider the term ||\E' — 'I'a'IIz;^ and decompose it using the 
triangle inequality into four terms 



\y-*K\\c 12 <J2\\Si(K)\\ Cl 



where 



A' 



S 2 (K) = ( cjvj 



A' 



j=i v 
S 4 (K) = ^ - V K . 



A (cj%) 
Aj 




A( ?' %) 

A, 


A(c jVj ) 
Aj 




A(c 7 -0 7 ) 
A j 


A{cjVj) 


- Vj (g> 


A(„j)A 


A, 


A, >/' 



(4.3) 

(4.4) 

(4.5) 
(4.6) 



The following simple lemma gives convergence of Si{K n ), provided K n — > 00. 

Lemma 4.4. Let {K n ,n > 1} be a random sequence taking values in N, such that K n — > 00 as 
n — > 00. Then ^k„ defined by the equation (4.2) converges to in probability. 
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Proof. Notice that since ||^||s 12 = Yl \\^{ v j)\\% < 00 f° r some orthonormal base {vj}, we can find 

j=l 

m £ G N such that — ^mlll^ = Yl \\^( v j)WH 2 — e ' whenever m > m £ . Hence 

j>m 



m) 



m=l 



P(K n < m £ 



□ 



The next three lemmas deal with terms (4.3)-(4.5). 



Lemma 4.5. Let Si(K) be defined by the equation (4.3) and U the constant derived in Lemma 4.1 
Then 

pm<K«)u» > < 1 m: ' 



e 2 n 



Proof. Note that for an orthonormal system {e^ G H\ \ i > 1} and any sequence {xj G H% \ i > 1} 
the following identity holds: 



A 

£ 



67" Q9 Xj 



5i2 i =1 



A" 



2 A 

= £ 

tf 2 i =1 



-r -II 2 



(4.7) 



Using this and the fact that the Hilbert-Schmidt norm bounds the operator norm we derive 



PiWSxiK^l > s) < P 



Kn 



1 



3=1 A i 



> e 



S12 



\ A A n 3=1 

<P(m 2 JA-A\\ 2 Sl2 >e). 



> £ 



By the Markov inequality 



P{\\Si{K n )\\l i2 > e) < E\\A - A||l 12 ^ < U^„ 



where the last inequality is obtained from Lemma 4.1 



□ 



Lemma 4.6. Let Sz(K) be defined by the equation (4.4) and U the constant from Lemma 4-5 Then 



P{\\S 2 (K n )\\ Cvi >e)<4U\\M% ia 



ml 



e 2 n 



Proof. Assumption K n < B n and identity (4.7) imply 

P(\\S 2 (K n )\\l i2 >e) = P^ E^-i 



I djVj <X> A(djVj) 



> £ 



12 



K 



< P max 



$~ 3 



\i — A,- 





\ ^3^3 


max 

3=1 


1 Xj - Xj 


\ A; 



I EllA^oll^ 

3=1 



> e 



> 



™nll A lls 12 



For simplifying the notation let b 2 



nil »Si2 



, then 



P(\\S 2 (K n )\\l 12 > s) < P n|x 



> b 



Kn 



1 K„ - 

< P max I A,- — A,- 1 > b n max IA, — XA < 



+ P ( max I A,- — A,- 1 > 

v 3=1 J Jl 2m, 



The first summand vanishes because 



P 



1 K n 



K n 



max|A 9 - — A 9 | > b n max|Aj — Xj\ < 



6 



2m r 



2A^„m n 
1 



>6 n lA/sr n -X K J < 



2m r , 



< p Uzr > x K n n |Ak„-AkJ < 



which is equal to for n large enough, since Xx n > rr~ and the distance between Xx n and Ax„ 



shrinks faster than tt=- • For the second term we use Lemma 

2m„ 



Kn 



4.3 



P(\\S2(K n W Cl2 >e)< P[m&\\j - Xj\ > 
<P(\\C-C\\ C11 > 



and the Markov inequality: 
b 



2m 
b 



2m r 



<^e\\c-c^ 

<4C/||A||| 12 ^. 

12 en 



□ 



Lemma 4.7. Le£ S3 (if) be defined by (4.5) and U be the constant defined in Lemma 4-5, then 

P(\\S 3 (K n )\\ Cl2 <e)< [/(128||A||i 12 + 4e 2 )^. 

Proof. By adding and subtracting the term CjVjA(vj) and using the triangle inequality we derive 

Kn 1 

/ ^-(cjVj <8> A(djVj) — Vj (g> A(«j)) 
Xj 

3 =i j 



P(\\S 3 (K n )\\ Cl2 > e) = P 



> £ 
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< P y a~ II W ® A(c i u J - - UjO + (£,■«,• - ty) A(vj) ||a 2 > e J 

< ^^^^-(IIAIUiaUcjUj -UjIIhi + \\cjVj -Vj||ifil|A|| A2 ) > ej. 
Now we split f2 = A U ^4 C where ^4 = {^— > 2m n } and get 



e 



P(l|S 3 (A'„)lk„ > £ )<P^JL-£ ||c« - „,|| H , > 



< 



For the first term in the inequality (4.8), by Lemma 4.2 definition of E n and the Markov inequality 
we get 



K n \ / 



P\ \\ciVj — Will Hi > i n^m — I < f( m n max ||c,-tL- — u,-||hi > i n^m — 

V~T 4m n ||A|| A2 y - y n j= i n ^ > nHl 4m n ||A|| £l2 



<pl^J-^\\C-C\\c 12 > . 2|| £ A|| 

I J = 1 4 ™nll A ll£l2 



<P ||C-C|| £l2 > 



y S^m^llAII^ 

--EllC-CII?. 
< 128||A||| ia m° g2 " 



< 128C/ ||A||^ 12 



2 m n 

e 2 n 



Since \k„ > the second term in the inequality (4.8) is bounded by 



P l A *« < 2^ J * P [ X ^ < 2^ n ^ " ^ ^ J + P \ ^ ~ A *-> > 2m, 
<P[\\C-C\\ Cl2 > ' 



. 2m n J 

<4Tr&E\\d-C\\l a <4U^. 

Thus we derive 

P(\\S 3 (K n )\\ Cl2 >a)< 128C/||A||| 12 ^ + 4U^ < C/(128|| A||| la + Ae 2 )^-. 



Finally we need a lemma which assures that K n tends to infinity. 



□ 
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p 

Lemma 4.8. Let K n be defined as in (K), then K n — > oo. 

Proof. We have to show that P(min{B n , E n } < p) — > for any p G N. Since — \ 0, for n large 
enough we have, by combining Lemma |4.1| and|4.3[ that 



□ 



P{B n <p) = P(X p <—)=p(x p -X p >X p -—)< p(\X p - X p \ > X p - — ) -> 0. 



Now we are ready to prove the main result 



Proof of Theorem 2.1, First, by the triangle inequality we get 



< \\Si{K n )\\ Cv2 + \\S 2 {K n )\\ Cl2 + \\S 3 (K n )\\ Cl2 + ||* - ^J| A 



By Lemmas 4.4 4.5 4.6 4.7 and assumption m„ = o(n) we finally obtain for large enough n that 



P(||*-^J| £l2 >e) 

< + 4 3 [/|| A||| l2 "4- + 4 2 t/(128|| A||i 12 + e 2 /A)^ + P(||* _ ^ ^ > e/4 ) ^> 0. 

□ 

4.2 Proof of Theorem [2721 

In order to simplify the notation we will denote K = K n . This time as a starting point we take a 
representation of * in the basis #2, ...}. Let M m = sp{f i, t>2, « m }, M m = sp{vi, $2, v m } 
where sp{xj, i 6 J} denotes the closed span of the elements {xi, i 6 J}. If rank((7) = £, then 
i > £} can be any ONB of ■ We write Pa for the projection operator which maps on a closed 
linear space A. As usual A 1 - denotes the orthogonal complement of A. Since for any m > 1 we can 
write x = P^ (x) + P^j ± (x), the linearity of * and the projection operator gives 

m 

= J2(vj,x) Hl ^{vj) + ^(PM±(x)). 
j'=i 

Now we evaluate * in some 0j which is not in the kernel of C. By definitions of *, C and again by 
linearity of the involved operators 

*(%) = f *((?(%)) 



1 1 " 



Y J {X l ,v 3 ) Hl ^{X l 
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where A = — i X^iLi AT, ® £j. Hence if m is such that A m > (which will now be implicitely assumed 
in the sequel), \I/ can be expressed as 

m j m ^ 



3=1 



A; 



Note that the first term on the right-hand side is just *$> m (x). Therefore for any x, the distance 
between ^(x) and *$> m (x) takes the following form 



\Mx)-* m (x)\\ Ha 



m ^ 



(4.9) 



ff 2 



To assess (4.9) we need the following four lemmas. 



Lemma 4.9. Let (Aj, fj)j>i and (Aj, Ui)i>i 6e eigenvalues and eigenf unctions of C and C respectively. 
Set j, m £ N suc/i i/iai j < m < n, then 

\\c-c\\ 2 Cn 



(Am+l ~ Aj) 2 



Proof. Note that by using Parseval's identity we get 

oo 

k=l k>m 

Now 

(A m+ i - Aj) 2 ^2(vj,v k ) 2 Hl < ^2(Xk{vj,v k ) Hl - Xj(vj, v k ) Hl ) 2 



k>m 



k>m 



k>m 

Since C is a self-adjoint operator, simple algebraic transformations yield 
(A m+ i - Xj) 2 ^2{vj,Vk) 2 Hl < ^2 ((C(vj), v h ) Hl - Xj{vj,Vk) Hl ) 2 

k>m 

= ^2(((fi-Cf)(vj),v k ) Hl ~ (Aj - X^ivj^k)^) 2 

k>m 



k>m 



k>m 



k>m 



By Parseval's inequality and Lemma |4.3 



(Wi - Aj) 2 £ {v„v k ) 2 Hl < 2\\(C - C)(v,)\\ 2 Hl + 2|Aj - Aj| 2 < 4||C - C||| u . 

k>m 



□ 
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Lemma 4.10. Let X be defined as in Lemma 
Proof. We first remark that for any e > 



2.2 and K = K n 4 oo. Then \\P m ±{X)\\h 2 4 0. 



P(\\Pm^X)\\ 2 H2 >e) = P[ £ \(v l ,X) Hl \ 2 >e 



\ i=K+l / 

Since \( v i-> X)h x | 2 = there exists a random variable J £ S K such that ^CSj e K^ij-^Olfil 2 < 

e. Since by assumption < oo, we conclude that J £ is bounded in probability. Hence we 

obtain 

P(\\PM^X)\\ 2 H 2 >e)<p( f; \(v l ,X) Hl \ 2 >e n K>J £ )+P(K<J £ ) 

\i=K+l ) 

= P(K<J £ ), 
where the last term converges to zero as n — > oo. 



□ 



Lemma 4.1 1. Let L n = argmaxjr < K : YH=i(^k+i — Aj) < v n }, where K = K n is given as in 



Theorem 



2.2 and v n — > oo. Then L n — > oo. 



Proof. Let r £ N such that for all 1 < i < r we have A r +i 7^ Aj. Note that < 00 implies 

Aj — t- and since \ > we can find infinitely many r satisfying this condition. We choose such r 
and obtain 



P(Ln<r)<P[ 



1 



> v n n if > r + P(if < r). 



Lemma 



4.8 



1 (Ak+i — Xi)' 

implies that P(K < r) — > 0. The first term is bounded by P ( Yll=i 



(A r +i— Xi) 2 

Since Aj Aj and r is fixed while v n — > 00, it follows that P(L n < r) — )■ if n — >• 00. Since r can be 
chosen arbitrarily large, the proof is finished. □ 



Lemma 4.12. Le£ Vl/ and X 5e defined as in Lemma 



2.2 



then \\P Mk (X)-P^(X)\\ Hi ^0 



Proof. Let us define two variables X^ = Yl^ii-^^'^H^i, X^ = v i)H l Vi and L as in 

Again for simplifying the notation we will write L instead of L n . Since x=xw+xw 



Lemma 



4.11 



we derive 



P Mk (X) - P^mU, < \\P Mk (X^) - P Kd (X^)\\ Hl + \\P,,(X^)\\ Hl + \\P Mk (X^)\\ Hi . 



(4.10) 



The last two terms are bounded by 2||X( 2 )||# 1 . For the first summand in (4.10) we get 



\p Mk (xU) - p^(xM)\\ Hl 



^2(X, Vi) Hl (vi - Pft K (vi)) 



i=l 
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Let us choose v n = o{n) in Lemma 4.11 The triangle inequality, the Cauchy-Schwarz inequality, 
Lemma 14.91 and the definition of L entail 

L 

\\P Mk {X {1) ) - P £l JXW)\\ H1 < V \(X, Vi) Hl \\\vi ~ PmM)\\ Hi 



i=l 



1/2 



1/2 



< 



^2\(x, Vi ) Hl \ 



vi=l 



<a=l 



< llxi 



Hi 



\i=l / 



-^("Ollft 
1/2 

X 1/2 



< 211X11^ ||C - C\\ Cll [J2 



i=1 (\k+i 
<2\\X\\ Hl \\C-C\\ Cll V^- 
This implies the inequality 

\\p Mk (x) ~ PmJX)\\ Hi < 2pr||Hjc - C\\ Cliy fa + 2\\X^\\ Hl . 



(4.11) 



Hence by Lemma 4.1 we have 211X1)^ \\C — CWc^yJvn = op(l). Furthermore we have that ||X( 2 ) 

£,->J<*,«i>l 5 



0. This follows from the proof of Lemma 



4.10 



□ 



Lemma 4.13. Let and X be defined as in Lemma 
Proof. Some simple manipulations show 



2.2 



then \\*(Pm±(X))\\ H2 4 0. 



V(P^{X))\\h 2 = \\*{X ~ Pm k (X))\\ H2 



= \MPm k (x) + p m ^x) - P Ak {X))\\ H2 

< \\*(P Mk (X)) - ^{P £ ,{X))\\h 2 + \MP M ± (x))\\ H2 



< w 



\P Mk (X) - PmJX)\\ Hi + \\P M ±(X)\\ Hl 



Direct applications of Lemma 4.10 and Lemma 4.12 finish the proof. 



□ 



Proof of Theorem 2.2. Set 



K„ 



3=1 A J 



By the representation (4.9) and the triangle inequality 



MX) - *(x)\\ Ha < \\e n (x)\\ H2 + \MPj&± (x))\\ H2 



Lemma |4.13| shows that the second term tends to zero in probability. 

we define = 0, then A = A and by independence of Ek and X& we get 



If in Lemma 



4.1 



A = 0. By the arguments of Lemma 4.5 we infer f(||®n||£i2 > £ ) — U-^f", which implies that 

II e„(x) ||if 2 4o. □ 
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Appendix 



n 


K° PT 

n 


NMSE 


Kl 

n 


NMSE 


K%- 5 

n 


NMSE 


K?: 1 

n 


NMSE 


10 


i 


3.26 


2 


3.50 


i 


3.26 


i 


3.26 


20 


i 


1.38 


3 


2.59 


2 


1.88 


i 


1.38 


40 


i 


1.14 


4 


1.73 


3 


1.29 


i 


1.14 


80 


3 


0.77 


6 


1.58 


4 


1.05 


i 


0.80 


160 


5 


0.48 


6 


0.62 


5 


0.48 


i 


0.73 


320 


7 


0.31 


7 


0.31 


6 


0.36 


2 


0.57 


640 


8 


0.18 


9 


0.19 


7 


0.19 


3 


0.33 


1280 


11 


0.11 


9 


0.11 


8 


0.12 


4 


0.25 


2560 


11 


0.06 


10 


0.07 


9 


0.08 


5 


0.17 


5120 


15 


0.03 


11 


0.04 


9 


0.05 


6 


0.10 


10240 


17 


0.02 


12 


0.02 


10 


0.03 


6 


0.10 


20480 


17 


0.01 


13 


0.01 


11 


0.02 


7 


0.06 






Table 1: 


#i, Ai 


,(7=1 








n 


ROPT 

n 


NMSE 


K 


NMSE 


lb 


NMSE 


n 


NMSE 


20 


4 


0.96 


12 


1.85 


9 


1.46 


l 


1.01 


80 


12 


0.68 


20 


0.93 


16 


0.79 


3 


0.85 


320 


20 


0.25 


27 


0.30 


23 


0.27 


9 


0.50 


1280 


29 


0.07 


30 


0.08 


27 


0.08 


17 


0.19 


5120 


35 


0.02 


31 


0.02 


30 


0.02 


23 


0.06 


20480 


34 





33 


0.01 


31 


0.01 


26 


0.02 






Table 2: 


*i, A 2 


,(7=1 








n 


K° PT 


NMSE 


K 


NMSE 




NMSE 




NMSE 


20 


7 


0.90 


18 


1.39 


16 


1.03 


5 


0.93 


80 


28 


0.58 


35 


0.88 


32 


0.62 


14 


0.71 


320 


35 


0.12 


35 


0.12 


35 


0.12 


33 


0.16 


1280 


35 


0.03 


35 


0.03 


35 


0.03 


35 


0.03 


5120 


35 


0.01 


35 


0.01 


35 


0.01 


35 


0.01 


20480 


35 





35 





35 





35 






Table 3: *i, A 3 , a = 1 
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n 


K° PT NMSE 


K\ NMSE 


K% 5 NMSE 


K®' 1 NMSE 


20 
80 
320 
1280 
5120 
20480 


1 2 
1 0.90 
4 0.45 
7 0.16 
11 0.05 
17 0.02 


3 3.92 
6 1.82 

8 0.70 

9 0.17 
11 0.05 
13 0.02 


2 2.68 
4 1.18 
6 0.49 

8 0.18 

9 0.07 
11 0.02 


1 2 

1 0.90 

2 0.57 

4 0.27 

5 0.17 
7 0.06 


Table 4: * 2 , Ai, a = 1 


n 


K% PT NMSE 


K\ NMSE 


K% 5 NMSE 


K^ 1 NMSE 


20 
80 
320 
1280 
5120 
20480 


1 1.08 
7 0.86 
20 0.40 
28 0.12 
32 0.03 
34 0.01 


11 2.54 
21 1.68 
27 0.51 

30 0.14 

31 0.03 
33 0.01 


9 1.97 
17 1.34 
24 0.43 
27 0.13 

30 0.04 

31 0.01 


1 1.08 
3 0.92 
9 0.54 
17 0.23 
22 0.08 
26 0.02 


Table 5: * 2 , A 2 , a = 1 


n 


ROPT NMSE 


K\ NMSE 


K% 5 NMSE 


K®- 1 NMSE 


20 
80 
320 
1280 
5120 
20480 


6 0.97 
22 0.75 
35 0.22 
35 0.05 
35 0.01 
35 


18 2.16 
35 1.49 
35 0.22 
35 0.05 
35 0.01 
35 


16 1.65 
32 1.12 
35 0.22 
35 0.05 
35 0.01 
35 


5 1.01 
14 0.83 
33 0.25 
35 0.05 
35 0.01 
35 


Table 6: * 2 , A 3 , a = 1 


n 


K° PT NMSE 


K\ NMSE 


K% 5 NMSE 


K®- 1 NMSE 


20 
80 
320 
1280 
5120 
20480 


1 35.28 
1 16.90 
1 1.93 
1 1.09 
3 0.64 
6 0.24 


3 77.60 

6 71.07 

7 15.85 
9 4.67 

11 1.22 
13 0.42 


2 68.64 
4 39.40 
6 13.28 

8 3.46 

9 1.18 
11 0.33 


1 35.28 

1 16.90 

2 2.87 
4 2.06 

6 0.83 

7 0.26 



Table 7: * 3 , Ai, a = 1 



20 



n 


K OPT 


NMSE 


K 


NMSE 


K 0.5 


NMSE 


K o.i 


NMSE 


20 


1 


2.39 


12 


61.32 


9 


40.39 


1 


2.39 


80 


1 


1.50 


20 


34.95 


16 


27.85 


3 


3.87 


320 


1 


1.15 


27 


10.96 


23 


9.81 


9 


4.35 


1280 


3 


1.04 


30 


3.20 


27 


2.66 


17 


2.01 


5120 


17 


0.53 


31 


0.74 


30 


0.73 


23 


0.64 


20480 


29 


0.18 


33 


0.21 


31 


0.20 


26 


0.19 






Table 8: 


*3, A 2 


,(7=1 








n 


k opt 

n 


NMSE 


K 


NMSE 


K 0.5 

n 


NMSE 


K o.i 

n 


NMSE 


20 


1 


1.33 


18 


47.52 


16 


27.09 


5 


4.04 


80 


1 


1.20 


35 


36.28 


32 


26.53 


14 


5.84 


320 


1 


1.05 


35 


5.24 


35 


5.24 


33 


4.50 


1280 


8 


0.96 


35 


1.22 


35 


1.22 


35 


1.22 


5120 


34 


0.28 


35 


0.28 


35 


0.28 


35 


0.28 


20480 


35 


0.08 


35 


0.08 


35 


0.08 


35 


0.08 



Table 9: * 3 , A 3 , a = 1 
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